glimpse() or the RStudio data view to see the names of
the variables in the data file. One variable name includes a space.
Write some code to remove this space, to make it easier to refer to that
variable.In this exercise, we will required the tidyverse,
janitor, and gt packages. We will be exploring
a data file, and introduce a few data manipulation options for data
cleaning.
library(tidyverse)
library(janitor)
library(gt)
The goal of this exercise is to explore and consider how to clean a messy dataset. The Metropolitan Museum of Art in New York City maintains a database of more than 470,000 artworks. For the purposes of this exercise, we are going to focus on a small sample of objects in a file which requires some data cleaning.
This exercise is structured to encourage you to perform and explore each cleaning step separately and then combine them at the end. In practice, you are welcome to add each step as you go.
The file is called
MetUnclean.csvand you can read the file in using theread_csvfunction. Choose `met_unclean’ as the name for the data frame if you want to be consistent with the rest of the exercise and the solutions we provide.
met_unclean <- read_csv("MetUnclean.csv")
Rows: 11 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Department, Object Title, Artist_Name, Artist_Nationality, Medium
dbl (2): Artist_Birth_Year, Object_Age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
One of the object titles doesn’t seem quite right. What do you think might have happened?
While it is not possible to know this with the information available, this object had a title in Japanese script that could not be properly rendered in our file format.
glimpse() or the RStudio data view to see the
names of the variables in the data file. One variable name includes a
space. Write some code to remove this space, to make it easier to refer
to that variable.Having spaces in a variable name requires the use of back ticks whenever that variable is referred to. So a helpful step is to remove the space going forward.
We will use the
rename()function to do this. You might like to replace the space with an underscore.Remember to save the result as a new data frame with a different name from the original data, e.g.
met_clean.
met_clean <- met_unclean %>%
rename(Object_Title = `Object Title`)
glimpse(met_clean)
Rows: 11
Columns: 7
$ Department <chr> "Drawings and Prints", "European Paintings", "Europ…
$ Object_Title <chr> "Petrarch's Laura", "A Hare and Birds", "Alexander …
$ Artist_Name <chr> "Enea Vico", "Jan Fyt", "Pietro Testa", "Thomas Wij…
$ Artist_Nationality <chr> "Italian", "Flemish", "Italian", "Dutch", "French",…
$ Artist_Birth_Year <dbl> 1523, 1611, 1612, 1616, 1741, 1838, 1886, 1894, 190…
$ Object_Age <dbl> 477, 389, 375, 404, 234, 132, 97, 106, 62, 9999, 33
$ Medium <chr> "Engraving", "Oil on canvas", "Oil on canvaSS", "Et…
Note that we could also use the clean_names function in
the janitor package to change all the variable names in a
systematic way (e.g. remove all spaces) for an entire data frame.
Use the
replacefunction within themutatefunction to correct this.Produce a frequency table using
tabylto check the results.
met_unclean %>%
tabyl(Medium) %>%
adorn_pct_formatting() %>%
gt()
| Medium | n | percent |
|---|---|---|
| Engraving | 1 | 9.1% |
| Etching | 1 | 9.1% |
| Graphite, ink, and watercolor | 1 | 9.1% |
| Oil on canvaSS | 1 | 9.1% |
| Oil on canvas | 2 | 18.2% |
| Pewter | 1 | 9.1% |
| Polychrome woodblock print | 1 | 9.1% |
| Red chalk | 1 | 9.1% |
| Silver dye bleach print | 1 | 9.1% |
| leather | 1 | 9.1% |
met_clean <- met_unclean %>%
mutate(Medium = replace(Medium, Medium == "Oil on canvaSS", "Oil on canvas"))
met_clean %>%
tabyl(Medium) %>%
adorn_pct_formatting() %>%
gt()
| Medium | n | percent |
|---|---|---|
| Engraving | 1 | 9.1% |
| Etching | 1 | 9.1% |
| Graphite, ink, and watercolor | 1 | 9.1% |
| Oil on canvas | 3 | 27.3% |
| Pewter | 1 | 9.1% |
| Polychrome woodblock print | 1 | 9.1% |
| Red chalk | 1 | 9.1% |
| Silver dye bleach print | 1 | 9.1% |
| leather | 1 | 9.1% |
It is always worth checking errors with the source, to ensure the correction is appropriate.
The advantage of this approach for correction is that it is well-documented, reproducible and easy to amend.
Use the
summarisefunction to look at the object age variables, as it is numeric. Usena_ifwithin themutatefunction to ensure the missing data point is stored in a more useful form.
met_unclean %>%
summarise(Mean = mean(Object_Age),
SD = sd(Object_Age),
Min = min(Object_Age),
Med = median(Object_Age),
Max = max(Object_Age),
n = n()) %>%
gt()
| Mean | SD | Min | Med | Max | n |
|---|---|---|---|---|---|
| 1118.909 | 2949.388 | 33 | 234 | 9999 | 11 |
met_clean <- met_unclean %>%
mutate(Object_Age = na_if(Object_Age, 9999))
met_clean %>%
summarise(Mean = mean(Object_Age, na.rm=T),
SD = sd(Object_Age, na.rm=T),
Min = min(Object_Age, na.rm=T),
Med = median(Object_Age, na.rm=T),
Max = max(Object_Age, na.rm=T),
n = n()) %>%
gt()
| Mean | SD | Min | Med | Max | n |
|---|---|---|---|---|---|
| 230.9 | 165.7645 | 33 | 183 | 477 | 11 |
This should be one pipeline that starts with the
met_uncleandata frame and produces a clean data frame calledmet_clean.
met_clean <-
met_unclean %>%
rename(Object_Title = `Object Title`) %>%
mutate(Medium = replace(Medium, Medium == "Oil on canvaSS", "Oil on canvas"),
Object_Age = na_if(Object_Age, 9999))
glimpse(met_clean)
Rows: 11
Columns: 7
$ Department <chr> "Drawings and Prints", "European Paintings", "Europ…
$ Object_Title <chr> "Petrarch's Laura", "A Hare and Birds", "Alexander …
$ Artist_Name <chr> "Enea Vico", "Jan Fyt", "Pietro Testa", "Thomas Wij…
$ Artist_Nationality <chr> "Italian", "Flemish", "Italian", "Dutch", "French",…
$ Artist_Birth_Year <dbl> 1523, 1611, 1612, 1616, 1741, 1838, 1886, 1894, 190…
$ Object_Age <dbl> 477, 389, 375, 404, 234, 132, 97, 106, 62, NA, 33
$ Medium <chr> "Engraving", "Oil on canvas", "Oil on canvas", "Etc…
Remember, if you don’t assign all your hard work to an object, it won’t be saved anywhere.
Download the airport screening file used in lectures and attempt to perform your own cleaning exercise.
© 2023 Statistical Consulting Centre, The University of Melbourne.